KNN Project Solution

Since KNN is such a simple algorithm, we will just use this "Project" as a simple exercise to test your understanding of the implementation of KNN. By now you should feel comfortable implementing a machine learning algorithm in R, as long as you know what library to use for it.

So for this project, just follow along with the bolded instructions. It should be very simple, so at the end you'll have an additional optional "bonus" project.

Get the Data

Iris Data Set

We'll use the famous iris data set for this project. It's a small data set with flower features that can be used to attempt to predict the species of an iris flower.

Use the ISLR libary to get the iris data set. Check the head of the iris Data Frame.

In [1]:
library(ISLR)
In [15]:
head(iris)
Out[15]:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
553.61.40.2setosa
65.43.91.70.4setosa
In [16]:
str(iris)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Standardize Data

In this case, the iris data set has all its features in the same order of magnitude, but its good practice (especially with KNN) to standardize features in your data. Lets go ahead and do this even though its not necessary for this data!

Use scale() to standardize the feature columns of the iris dataset. Set this standardized version of the data as a new variable.

In [3]:
stand.features <- scale(iris[1:4])

Check that the scaling worked by checking the variance of one of the new columns.

In [4]:
var(stand.features[,1])
Out[4]:
1

Join the standardized data with the response/target/label column (the column with the species names.

In [5]:
final.data <- cbind(stand.features,iris[5])
In [6]:
head(final.data)
Out[6]:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1-0.89767391.015602-1.335752-1.311052setosa
2-1.1392-0.1315388-1.335752-1.311052setosa
3-1.3807270.3273175-1.392399-1.311052setosa
4-1.501490.09788935-1.279104-1.311052setosa
5-1.0184371.24503-1.335752-1.311052setosa
6-0.5353841.933315-1.165809-1.048667setosa

Train and Test Splits

Use the caTools library to split your standardized data into train and test sets. Use a 70/30 split.

In [7]:
set.seed(101)

library(caTools)

sample <- sample.split(final.data$Species, SplitRatio = .70)
train <- subset(final.data, sample == TRUE)
test <- subset(final.data, sample == FALSE)

Build a KNN model.

Call the class library.

In [8]:
library(class)

Use the knn function to predict Species of the test set. Use k=1

In [9]:
predicted.species <- knn(train[1:4],test[1:4],train$Species,k=1)
In [10]:
predicted.species
Out[10]:
  1. setosa
  2. setosa
  3. setosa
  4. setosa
  5. setosa
  6. setosa
  7. setosa
  8. setosa
  9. setosa
  10. setosa
  11. setosa
  12. setosa
  13. setosa
  14. setosa
  15. setosa
  16. versicolor
  17. versicolor
  18. versicolor
  19. versicolor
  20. versicolor
  21. virginica
  22. versicolor
  23. versicolor
  24. versicolor
  25. versicolor
  26. versicolor
  27. virginica
  28. versicolor
  29. versicolor
  30. versicolor
  31. virginica
  32. virginica
  33. virginica
  34. virginica
  35. virginica
  36. virginica
  37. virginica
  38. virginica
  39. virginica
  40. virginica
  41. virginica
  42. virginica
  43. virginica
  44. virginica
  45. virginica

What was your misclassification rate?

In [11]:
mean(test$Species != predicted.species)
Out[11]:
0.0444444444444444

Choosing a K Value

Although our data is quite small for us to really get a feel for choosing a good K value, let's practice.

Create a plot of the error (misclassification) rate for k values ranging from 1 to 10.

In [12]:
predicted.species <- NULL
error.rate <- NULL

for(i in 1:10){
    set.seed(101)
    predicted.species <- knn(train[1:4],test[1:4],train$Species,k=i)
    error.rate[i] <- mean(test$Species != predicted.species)
}
In [13]:
library(ggplot2)
k.values <- 1:10
error.df <- data.frame(error.rate,k.values)
In [14]:
pl <- ggplot(error.df,aes(x=k.values,y=error.rate)) + geom_point()
pl + geom_line(lty="dotted",color='red')

You should have noticed that the error drops to its lowest for k values between 2-6. Then it begins to jump back up again, this is due to how small the data set it. At k=10 you begin to approach setting k=10% of the data, which is quite large.

Optional Assignment

You should feel pretty comfortable using KNN since its so simple. As an optional assignment, choose a data set from the UCI Machine Learning Repository and see if you can use the process laid out above to do your own classification!

Check out the data sets here

Great Job!